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ABSTRACT 

In this paper, we propose a stochastic search algorithm for solving general optimization 
problems with little structure. The algorithm iteratively finds high quality solutions by 
randomly sampling candidate solutions from a parameterized distribution model over the 
solution space. The basic idea is to convert the original (possibly non-differentiable) prob- 
lem into a differentiable optimization problem on the parameter space of the parameterized 
sampling distribution, and then use a direct gradient search method to find improved sam- 
pling distributions. Thus, the algorithm combines the robustness feature of stochastic 
search from considering a population of candidate solutions with the relative fast conver- 
gence speed of classical gradient methods by exploiting local differentiable structures. We 
analyze the convergence and converge rate properties of the proposed algorithm, and carry 
out numerical study to illustrate its performance. 

1. Introduction 

We consider global optimization problems over real vector-valued domains. These opti- 
mization problems arise in many areas of importance and can be extremely difficult to 
solve due to the presence of multiple local optimal solutions and the lack of structural 
properties such as differentiability and convexity. In such a general setting, there is little 
problem-specific knowledge that can be exploited in searching for improved solutions, and 



it is often the case that the objective function can only be assessed through the form of 
"black-box" evaluation, which returns the function value for a specified candidate solution. 

An effective and promising approach for tackling such general optimization problems 
is stochastic search. This refers to a collection of methods that use some sort of ran- 
domized mechanism to generate a sequence of iterates, e.g., candidate solutions, and then 
use the sequence of iterates to successively approximate the optimal solution. Over the 
past years, various stochastic search algorithms have been proposed in literature. These 
include approaches such as simulated annealing [TO], genetic algorithms f7], tabu search 
[6], pure adaptive search [28j, and sequential Monte Carlo simulated annealing [29j, which 
produce a sequence of candidate solutions that are gradually improving in performance; 
the nested partitions method [25] , which uses a sequence of partitions of the feasible region 
as intermediate constructions to find high quality solutions; and the more recent class of 
model-based algorithms (see a survey by [30] ) , which construct a sequence of distribution 
models to characterize promising regions of the solution space. 

This paper focuses on model-based algorithms. These algorithms typically assume a 
sampling distribution (i.e., a probabilistic model), often within a parameterized family of 
distributions, over the solution space, and iteratively carry out the two interrelated steps: 
(1) draw candidate solutions from the sampling distribution; (2) use the evaluations of these 
candidate solutions to update the sampling distribution. The hope is that at every iteration 
the sampling distribution is biased towards the more promising regions of the solution space, 
and will eventually concentrate on one or more of the optimal solutions. Examples of model- 
based algorithms include ant colony optimization [HE], annealing adaptive search (AAS) 
[22], probability collectives (PCs) [27], the estimation of distribution algorithms (ED As) 
|14t [T9j . the cross-entropy (CE) method [23], model reference adaptive search (MRAS) 
[8], and the interacting-particle algorithm |17| I18j . The various model-based algorithms 
mainly differ in their ways of updating the sampling distribution. Recently, showed 
that the updating schemes in some model-based algorithms can be viewed under a unified 
framework. The basic idea is to convert the original optimization problem into a sequence 
of stochastic optimization problems with differentiable structures, so that the distribution 
updating schemes in these algorithms can be equivalently transformed into the form of 
stochastic approximation procedures for solving the sequence of stochastic optimization 
problems. 

Because model-based algorithms work with a population of candidate solutions at each 
iteration, they demonstrate more robustness in exploring the solution space as compared 
with their classical counterparts that work with a single candidate solution each time (e.g., 
simulated annealing). The main motivation of this paper is to integrate this robustness 
feature of model-based algorithms into familiar gradient-based tools from classical differen- 
tiable optimization to facilitate the search for good sampling distributions. The underlying 
idea is to reformulate the original (possibly non-differentiable) optimization problem into a 
differentiable optimization problem over the parameter space of the sampling distribution, 
and then use a direct gradient search method on the parameter space to solve the new 
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formulation. This leads to a natural algorithmic framework that combines the advantages 
of both methods: the fast convergence of gradient-based methods and the global explo- 
ration of stochastic search. Specifically, each iteration of our proposed method consists 
of the following two steps: (1) generate candidate solutions from the current sampling 
distribution; (2) update the parameters of the sampling distribution using a direct gra- 
dient search method. Although there are a variety of gradient-based algorithms that are 
applicable in step (2) above, in this paper we focus on a particular algorithm that uses a 
quasi-Newton like procedure to update the sampling distribution parameters. Note that 
since the algorithm uses only the information contained in the sampled solutions, it differs 
from the quasi-Newton method in deterministic optimization, in that there is an extra 
Monte Carlo sampling noise involved at each parameter updating step. We show that this 
stochastic version of quasi-Newton iteration can be expressed in the form of a generalized 
Robbins-Monro algorithm, and this in turn allows us to use the existing tools from stochas- 
tic approximation theory to analyze the asymptotic convergence and convergence rate of 
the proposed algorithm. 

The rest of the paper is organized as follows. We introduce the problem setting for- 
mally in Section [2| Section [3] provides a description of the proposed algorithm along with 
the detailed derivation steps. In Section [4j we analyze the asymptotic properties of the 
algorithm, including both convergence and convergence rate. Some preliminary numerical 
study are carried out in Section [5] to illustrate the performance of the algorithm. Finally, 
we conclude this paper in Section |6j All the proofs are contained in the Appendix. 

2. Problem Formulation 

Consider the maximization problem 



where the solution space X is a nonempty compact set in M"", and H : X ^ M is a real- 
valued function. Denote the optimal function value as H* , i.e., there exists an x* such that 
H{x) < H* = H{x*), Vx G X. Assume that H is bounded on X, i.e., 3Hib > — oo. Hub < 
oo s.t. Hih < H{x) < Hub, Vx G X . We consider problems where the objective function 
H{x) lacks nice structural properties such as differentiability and convexity and could have 
multiple local optima. 

Motivated by the idea of using a sampling distribution/probabilistic model in model- 
based optimization, we let {f{x]9)\9 G C M'^} be a parameterized family of probability 
density functions (pdfs) on X with G being a parameter space. Intuitively, this collection 
of pdfs can be viewed abstractly as probability models characterizing our knowledge or 
belief of the promising regions of the solution space. It is easy to see that 
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In this paper, we simply write J with the understanding that the integrals are taken over 
X. Note that the equality on the righthand side above is achieved whenever there exists 
an optimal parameter under which the parameterized probability distribution will assign 
all of its probability mass to a subset of the set of global optima of ([T]). Hence, one natural 
idea to solving (jl]) is to transform the original problem into an expectation of the objective 
function under the parameterized distribution and try to find the best parameter 6* within 
the parameter space Q such that the expectation under f{x,0*) can be made as large as 
possible, i.e., 

e* = argmax j H{x)f{x;e)dx. (2) 

So instead of considering directly the original function H(x) that is possibly non-differentiable 
and discontinuous in x, we now consider the new objective function J H{x)f{x; 9)dx that is 
continuous on the parameter space and usually differentiable with respect to 9. For exam- 
ple, under mild conditions the differentiation can be brought into the integration to apply 
on the p.d.f. f{x] 9), which is differentiable given an appropriate choice of the distribution 
family such as an exponential family of distributions. 

The formulation of ^ suggests a natural integration of stochastic search methods 
on the solution space X with gradient-based optimization techniques on the continuous 
parameter space. Conceptually, that is to iteratively carry out the following two steps: 

1. Generate candidate solutions from f{x;9) on the solution space X. 

2. Use a gradient-based method for the problem ^ to update the parameter 9. 

The motivation is to speed up stochastic search with a guidance on the parameter space, 
and hence combine the advantages of both methods: the fast convergence of gradient-based 
methods and the global exploration of stochastic search methods. Even though problem 
([2]) may be non-concave and multi- modal in 9, the sampling from the entire original space 
X compensates the local exploitation along the gradient on the parameter space. In fact, 
our algorithm developed later will automatically adjust the magnitude of the gradient 
step on the parameter space according to the global information, i.e., our belief about the 
promising regions of the solution space. 

For algorithmic development later, we introduce a shape function : M — t- M+, where 
the subscript 9 signifies the possible dependence of the shape function on the parameter 9. 
The function Sg satisfies the following conditions: 

(a) For every 9, Sg{y) is nondecreasing in y and bounded from above and below for 
bounded y, with the lower bound being away from zero. Moreover, for every fixed y, 
Se{y) is continuous in 9; 

(b) The set of optimal solutions {avgmaxxizx Sg{H{x))} is a non-empty subset of {avgniaxx^x H{x)}, 
the set of optimal solutions of the original problem ([T]) . 
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Therefore, solving ([T]) is equivalent to solving the following problem 

X* £ argmaxSg{H{x)). (3) 

The main reason of introducing the shape function Sq is to ensure positivity of the objective 
function Se{H{x)) under consideration, since Se{H{x)) will be used to form a probability 
density function later. Moreover, the choice of can also be used to adjust the trade-off 
between exploration and exploitation in stochastic search. One choice of such a shape 
function, similar to the level/indicator function used in the CE method and MRAS, is 

SeiHix)) = {H{x) - ^^5) y^^t^jW^' 

where Sq is a large positive constant, and 79 is the (1 — /9)-quantile 

7e = sup{/ : Pg{x G X : H{x) > 1} > p} , 
I 

where P$ denotes the probability with respect to /(•; 6). Notice that l/(^l-\-e~^°^^^^')~^^^) is 
a continuous approximation of the indicator function I{H(x) > 70}, this shape function Sq 
essentially prunes the level sets below 751. By varying p, we can adjust the percentile of elite 
samples that are selected to update the next sampling distribution: the bigger p, the less 
elite samples selected and hence more emphasis is put on exploiting the neighborhood of the 
current best solutions. Sometimes the function Sg could also be chosen to be independent 
of 9, i.e., Sq = S : ^ M^', such as the function S{y) = exp(?/). 
For an arbitrary but fixed 9' G M'^, define the function 

L{9;9') ^ j Sg,{H{x))f{x-9)dx. 

According to the conditions on Sg, it always holds that 

< L{9;9') < Sg'{H*) y9, 

and the equality is achieved if there exists an optimal parameter such that the probability 
mass of the parameterized distribution is concentrated only on a subset of the set of global 
optima. Following the same idea that leads to ([2]) , solving ^ and thus ([T]) can be converted 
to the problem of trying to find the best parameter 9* within the parameter space by solving 
the following maximization problem: 

r = argmaxL(0;e')- (5) 

6*60 

Same as problem ([2]), L{9\ 9') may be nonconcave and multi-modal in 9. 
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3. Gradient-Based Adaptive Stochastic Search 



Following the formulation in the previous section, we propose a stochastic search algorithm 
that carries out the following two steps at each iteration: let 6k be the parameter obtained 
at the k*^ iteration, 

1. Generate candidate solutions from f{x;9k). 

2. Update the parameter to O^+i using a quasi Newton's iteration for maxgL(^;0fc). 

Assuming it is easy to draw samples from /(x; 6), then the main obstacle is to find expres- 
sions of the gradient and Hessian of L{9; 6^) that can be nicely estimated using the samples 
from f{x;9). To overcome this obstacle, we choose {f{x;9)} to be an exponential family 
of densities defined as below. 

Definition 1. A family {f{x;9) : 9 G 0} is an exponential family of densities if it satisfies 



(6) 



where T{x) = [ri(x), T2(x), . . . , Tii{x)]^ is the vector of sufficient statistics, 9 = [9i,92, ■ ■ ■ , 9d\^ 
is the vector of natural parameters, and Q = {9 : \(t>{9)\ < oo} is the natural parame- 
ter space with a nonempty interior. 

Define the density function 



^ Se{Hix))fix-9) SeiHix))fix;9) 



' > JSe{H{x))f{x;9)dx m9) ' ^ ^ 

With f{-]9) from an exponential family, we propose the following updating scheme for 9 
in step 2 above: 

9k+i = 9k + afc(Varejr(X)] + el)-^ {Er>,[T{X)] - Ee,[T{X)]) , (8) 

where e > is a small positive number, > is the step size, Ep^ denotes the expectation 
with respect to p{-; 9k), and Eq^ and Varg^, denote the expectation and variance taken with 
respect to f{-;9k), respectively. The role of el is to ensure the positive definiteness of 
(Var0jr(X)] + el) such that it can be inverted. The term {Ep^[T{X)] - Ee^[T{X)]) is an 
ascent direction of L{9; 9^), which will be shown in the next section. 

To implement the updating scheme Q, the term Epf.[r{X)] is often not analytically 
available and needs to be estimated. Suppose {xi, . . . , xn^} are independent and identically 
distributed (i.i.d.) samples drawn from f{x;9k)- Since 



Ep^[T{X)]=Ee, 
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we compute the weights {wl} for the samples {x\} according to 

J (Xf^, "kj 

N 
i=l 

Hence, Epf, [T{X)] can be approximated by 

Ep,[T{X)]=Y,^iT{xl). (9) 

i=l 

Some forms of the function Sgi_{H{x)) have to be approximated by samples as well. For 
example, if S0^{H{x)) takes the form Q, the quantile 79^, needs to be estimated by the 
sample quantile. In this case, we denote the approximation by Sgi^{H{x)), and evaluate 
the normalized weights according to 

w'^^Se,{H{xi)), i = l,...,Nk. 

Then the term Ep^. [T{X)] is approximated by 

Ep,[T{X)]=J2^iT{xl). (10) 

i=l 



In practice, the variance term Varg^. [r(X)] in ([8]) may not be directly available or could be 
too complicated to compute analytically, so it also often needs to be estimated by samples: 

^ i=l ^ \i=l / \i=l / 

The expectation term Egj^[T{X)] can be evaluated analytically in most cases. For example, 
if {f{-;Ok)} is chosen as the Gaussian family, then £'g^[T(X)] reduces to the mean and 
second moment of the Gaussian distribution. 

Based on the updating scheme of 9, we propose the following algorithm for solving Q. 
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Algorithm 1 Gradient-Based Adaptive Stochastic Search (GASS) 



1. Initialization: choose an exponential family of densities {/(•; 6)}, and specify a small 
positive constant e, initial parameter sample size sequence {Nk}, and step size 
sequence {ofc}. Set A; = 0. 

2. Sampling: draw samples ~ f{x; 9k), i = 1,2,..., A^'^. 

3. Estimation: compute the normalized weights w\. according to 

4(^(4)) 



and then compute Ep^[T{X)] and Varej.[T(X)] respectively according to (10) and 

4. Updating: update the parameter 6 according to 

0k+i = [0k + ak{V^rg,[T{X)] + eI)-\%,[T [X)] - Ee,[T{X)])] , 

where B C is a non-empty compact connected constraint set, and IIq denotes 
the projection operator that projects an iterate back onto the set by choosing the 
closest point in 0. 

5. Stopping: check if some stopping criterion is satisfied. If yes, stop and return the 
current best sampled solution; else, set k := k + 1 and go back to step 2. 



In the above algorithm, at the k iteration candidate solutions are drawn from the 
sampling distribution f{-;6k), and then are used to estimate the quantities in the updating 
equation for 9k so as to generate the next sampling distribution f[-;6k+i)- To develop 
an intuitive understanding of the algorithm, we consider the special setting T{X) = X, 
in which case the term Varg^, [T(X)] basically measures how widespread the candidate 
solutions are. Since the magnitude of the ascent step is determined by (Vare^, [T(X)]+e/)~^, 
the algorithm takes smaller ascent steps to update 6 when the candidate solutions are more 
widely spread (i.e., Var0j.[X] is larger), and takes larger ascent steps when the candidate 
solutions are more concentrated (i.e., Varg^[X] is smaller). It means that exploitation of 
the local structure is adapted to our belief about the promising regions of the solution 
space: we will be more conservative in exploitation if we are uncertain about where the 
promising regions are and more progressive otherwise. Note that the projection operator 
at step 4 is primarily used to ensure the numerical stability of the algorithm. It prevents 
the iterates of the algorithm from becoming too big in practice and ensures the sequence 
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{9k} to stay bounded as the search proceeds. For simphcity, we wiU assume that is a 
hyper-rectangle and takes the form Q = {6 £ Q : ai < 6i < bi} for constants Oj < bi, 
i = 1,. . . ,d; other more general choices of G may also be used (see, e.g., Section 4.3 of 
[H]). Intuitively, such a constraint set should be chosen sufficiently large in practice so 
that the limits of the recursion at step 4 without the projection are contained in its interior. 



3.1 Accelerated GASS 

GASS can be viewed as a stochastic approximation (SA) algorithm, which we will show in 
more details in the next section. To improve the convergence rate of SA algorithms, [20j 
and [24j first proposed to take the average of the 6 values generated by previous iterations, 
which is often referred to as Polyak (or Polyak-Ruppert) averaging. The original Polyak 
averaging technique is "offline" , i.e., the averages are not fed back into the iterates of 0, and 
hence the averages are not useful for guiding the stochastic search in our context. However, 
there is a variation, Polyak averaging with online feedback (c.f. pp. 75 - 76 in [l3]), which 
is not optimal as the original Polyak averaging but also enhances the convergence rate of 
SA. Using the Polyak averaging with online feedback, the parameter 9 will be updated 
according to 

Ok+i = l^fc + afc (vTr,, [T{X)] + el) [T{X)] - Ee, [T{X)]) + a,,c(4 - ^fe)} , 

(12) 

where the constant c is the feedback weight, and 9k is the average 

1 ^ 

Ok = l^0^, 

1=1 

which can be calculated recursively by 

4 = ^4-i + |. (13) 
With this parameter updating scheme, we propose the following algorithm. 



Algorithm 2 Gradient-based Adaptive Stochastic Search ^'itYi Averaging 
(GASS_avg) 

Same as Algorithm [T] except in step 4 the parameter updating follows (12) and (13). 



3.2 Derivation 

In this subsection, we explain the rationale behind the updating scheme ([s]). We first derive 
the expressions of the gradient and Hessian of L{9] 9') as given below. 
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Proposition 1. Assume that f{x;9) is twice dijferentiable on Q and that Ve/(x;^) and 
Vg/(x; 9) are both bounded on X for any 6 £ @. Then 

Verne') = Ee[Se'{H{X))Ve\nf{X-9)] 
Vlme') = Ee[Se>{H{X))Vl\nf{X-e)] 

+ Ee[Se' {H{X))Ve In /(X; e)Ve In f{X- Of]. 

Furthermore, if f{x; 6) is in an exponential family of densities defined by then the 
above expressions reduce to 

Verne') = Ee[Se'{H{X))T{X)]-Ee[Se>{H{X))]Ee[T{X)l 
Vime') = Ee[Se>{H{X)){T{X)-Ee[T{X)]){T{X)-Ee[T{X)])^] 
- Ya.Tg[T{X)]Ee[Sg>iH{X))]. 

Notice that if we were to use Newton's method to update the parameter 9, the Hessian 
VgL(^; e') is not necessarily negative semidefinite to ensure the parameter updating is 
along the ascent direction of L(9;e'), so we need some stabilization scheme. One way 
is to approximate the Hessian by the second term on the righthand side with a small 
perturbation, i.e., —{YaTg[T{X)] + eI)Ee[Se'{H{X))], which is always negative definite. 
Thus, the parameter 6 could be updated according to the following iteration 

9k+i = 9k + ak{{Yave,[T{X)] + eI)Ee,[Se,{H{X))])-^VeL{ek;ek), (14) 

= 9, + a,{Y.re,[nX)] + eI) [-^^JSejHmT ~ ^ ' 

which immediately leads to the updating scheme ([s]) given before. 

In the updating equation ([s]), the term Eg^ [Sg^, {H{X))]~^ is absorbed into VgL^ef^; 6^), 
so we obtain a scale- free term {Epf, [T(X)] — Ee^. \T{X)\) that is not subject to the scaling of 
the function value of Sej,[H{x)). It would be nice to have such a scale-free gradient so that 
we can employ other gradient-based methods more easily besides the above specific choice 
of a quasi-Newton method. Towards this direction, we consider a further transformation 
of the maximization problem ^ by letting 

l{e-e') = lnL{9;9'). 

Since In : — t- i? is a strictly increasing function, the maximization problem ^ is 
equivalent to 

9* = avgmaxl{e;e'). (15) 

em'' 

The gradient and the Hessian of l{9;9') are given in the following proposition. 
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Proposition 2. Assume that f{x;9) is twice dijferentiable on Q and that Ve/(x;^) and 
Vg/(x; 9) are both bounded on X for any 9 ^ @. Then 

Vem9')\e=e' = Ep^.-eni'^eln f{X;9')] 

Vlm9')\e=e' = Ep^..e')Nllnf{X;9')]+YaVp^..e') [Vein f{X;9')] . 

Furthermore, if f{x;9) is in an exponential family of densities, then the above expressions 
reduce to 

Vem9')\e=e' = Ep(^..e')[nx)] - Ee'[T{X)], 
Vlm9')\e=e' = Varp(,e,)[T(X)]-Var,4T(X)]. 

Similarly as before, noticing that the Hessian VqI{9'; 9') is not necessarily negative def- 
inite to ensure the parameter updating is along the ascent direction of l{9; 9'), we approxi- 
mate the Hessian by the slightly perturbed second term in Vgl{9'] 9'), i.e., — (Vare/[T(X)] + 
el). Then by setting 

9k+i =9k + ak (Vare,[r(X)] + eiy^ Vel{9k; 9k), 



we again obtain exactly the same updating equation (|8|) for 9. The difference from (14) is 
that the gradient Vgl{9; 9') is a scale-free term, and hence can be used in other gradient- 
based methods with easier choices of the step size. From the algorithmic viewpoint, it is 
better to consider the optimization problem (15) on l{9; 9') instead of the problem ^ on 
L{9;9'), even though both have the same global optima. 

Although there are many ways to determine the positive definite matrix in front of the 
gradient in a quasi-Newton method, our choice of (Var^^ [T(X)] + el)~^ is not arbitrary but 
based on some principle. Without considering the numerical stability and thus dropping 
the term el, the term YaTg[T{X)] = E[Vgln f{X ■,9){Vg In f{X; 9))^] = E[-Vjln f{X;9)] 
is the Fisher information matrix, whose inverse provides a lower bound on the variance of 
an unbiased estimator of the parameter 9 ([2Tj), leading to the fact that (yaicg[T{X)])~^ 
is the minimum- variance step size in stochastic approximation (jl6]). Moreover, from 
the optimization perspective, the term (yar:g[T{X)])~^ relates the gradient search on the 
parameter space with the stochastic search on the solution space, and thus adaptively 
adjusts the updating of the sampling distribution to our belief about the promising regions 
of the solution space. To see this more easily, let us consider T{X) = X. Then (Vare[X])~^ 
is smaller (i.e., the gradient step in updating 9 is smaller) when the current sampling 
distribution is more flat, signifying the exploration of the solution space is still active and 
we do not have a strong belief (i.e. f{-;9)) about promising regions; (Vare[X])~^ is larger 
(i.e., the gradient step in updating 9 is larger) when our belief f{-;9) is more focused on 
some promising regions. 
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4. Convergence Analysis 



We will analyze the convergence properties of GASS, resorting to methods and results in 
stochastic approximation (e.g., [121 [131 [1] ) . In GASS, S/ol{6; 9k)\e=ek is estimated by 

V0l{0k;ek) = ^pjT(X)] - Eg^[T{X)]. (16) 

To simplify notations, we denote 

% = Y^re,[T{X)] + eI, = Yare,[T{X)] + el . 

Hence, the parameter updating iteration in GASS is 

= He {Ok + akV^^Vem; ^fc)} , (17) 
which can be rewritten in the form of a generalized Robbins-Monro algorithm 

ek+i = ek + ak[D{9k) + bk+ik + Zk], (18) 

where 



DiOk) = (Var,jr(X)] + e/)-We/(efc;0, 



h = V^'[E,^[T{X)]-E,^[T{X)]y 

Ck = - V,') {Ep,[T{X)] - Ee,[T{X)]) + V,-' (^pjr(X)] - E,,[T{X)]) , 

and Zk is the projection term satisfying akZk = 0k+i — 0k — ak[D{6k)+bk+6.k], the minimum 
Euclidean length vector that takes the current iterate back onto the constraint set. The 
term D{9k) is the gradient vector field, bk is the bias due to the inexact evaluation of the 
shape function in Epi,[T{X)] (bk is zero if the shape function can be evaluated exactly), 
and ^k is the noise term due to Monte Carlo sampling in the approximations Varg^. [T(X)] 
and^pjr(X)]. 

For a given 9 £ Q, we define a set C{9) as follows: if 9 lies in the interior of 0, let 
C{9) = {0}; if 9 lies on the boundary of Q, define C{9) as the infinite convex cone generated 
by the outer normals at 9 of the faces on which 9 lies ( |13| pp. 106). The difference equation 



( 18 ) can be viewed as a noisy discretization of the constrained ordinary differential equation 
(ODE) 

9t = D{9t) + zu zte-C{9t), t>0, (19) 
where zt is the minimum force needed to keep the trajectory of the ODE in Q. Thus, 



the sequence of {9k} generated by (18) can be shown to asymptotically approach the 



solution set of the above ODE (19) by using the well-known ODE method. Let 



denote the vector supremum norm (i.e., ||x|| = max{|xj|}) or the matrix max norm (i.e.. 
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Il^ll = max{|aij|}). Let || • II2 denote the vector 2-norm (i.e., ||x|| = ^y x1 + . . . + x^) or the 
matrix norm induced by the vector 2-norm (also called spectral norm for a square matrix, 
i.e., \\A\\2 = sj Amax(^*^)! where A* is the conjugate transpose of A and \max returns the 
largest eigenvalue). 

To proceed to the formal analysis, we introduce the following notations and assump- 
tions. We denote the sequence of increasing sigma-fields generated by all the samples up 
to the iteration by 

= a ({x^^^i, . . . , , = 0, 1, . . .} . 

Define notations 

i=l i=l 

■■= jfY.Se,[H[x^,))T{xl), % :=—Y^Se,{H{xl)) 

1=1 1=1 
Ufe := Ee,[Se,{HiX))T{X)], := Ee,[Se,{H{X))]. 



Assumption 1. 

(i) The step size sequence {ak} satisfies Ok > for all k, Ok \ as k ^ 00, and 
k=o ak = 00. 

(a) The sample size = N^k'^ , where C > 0; moreover, {ak} o.nd {N^} jointly satisfies 

= 0{k~^) for some constant /3 > 1. 
(Hi) The function x 1— >■ T(x) is bounded on X . 
(iv) For any x, \So^{H{x)) — Sg^{H{x))\ — )• w.p.l as Nk — )• 00. 

In the above assumption, (i) is a typical assumption on the step size sequence in SA, 
which means that diminishes not too fast. Assumption [T|ii) provides a guideline on 
how to choose the sample size given a choice of the step size sequence, and shows that the 
sample size has to increase to infinity no slower than a certain speed. For example, if we 
choose ak = ao^~" with < a < 1, then it is sufficient to choose = 0{k'^^^~°'^). As- 
sumption [Tjj^iii) holds true for many exponential families used in practice. Assumption [Tj^iv) 
is a sufficient condition to ensure the strong consistency of estimates, and is satisfied by 
many choices of the shape function Sg. For example, it is trivially satisfied if So = S, since 
S{H{x)) can be evaluated exactly for each x. If Sg takes the form of Q, Assumption [Tj^iv) 
is also satisfied, as shown in the following lemma. 

Lemma 1. Suppose the shape function takes the form 

Se,iHix)) = iHix)-Hi,) ^ 
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where 751^. = supj {/ : Pe^{x G X : H{x) > 1} > p} is the unique {1 — p)-quantile with respect 
to /(•; 9k)- If Sef,{H{x)) is estimated by S0i,{H{x)) with the true quantile jg^, being replaced 
by the sample (1 — p)-quantile 79^, = f/'(|-(;^_p)jVfe])) where \a~\ is the smallest integer greater 
than a, and is the i^^ order statistic of the sequence {H{x\), i = 1, . . . , N^}- Then un- 



der the condition Nk 
w.p.l as A; — )• 00. 



Q{k'^) C > 0, we have that for every x, \Sgi,{H{x)) — Se^.{H{x))\ 



The next lemma shows that the summed tail error goes to zero w.p.l. 
Lemma 2. Under Assumption^ (i)-(iii), for any T > 0, 



lim < sup 



E 

i=k 



0, W.p.l. 



Theorem [T] below shows that GASS generates a sequence {6^} that asymptotically 
approaches the limiting solution of the ODE ( 19 ) under the regularity conditions specified 
in Assumption [T| 

Theorem 1. Assume that D{6t) is continuous with a unique integral curve (i.e., the ODE 
has a unique solution 0{t) ) and Assumption^holds. Then the sequence {Ok} generated 
by converges to a limit set of (igp w.p.l. Furthermore, if the limit sets of are 
isolated equilibrium points, then w.p.l {Ok} converges to a unique equilibrium point. 

For a given distribution family, Theorem 1 shows that our algorithm will identify a 
local/global optimal sampling distribution within the given family that provides the best 
capability in generating an optimal solution to ([T]). From the viewpoint of maximizing 
Eq[H[X)], the average function value under our belief of where promising solutions are 
located (i.e., the parameterized distribution f{x,0)), the convergence of the algorithm to 
a local/global optimum in the parameter space essentially gives us a local/global optimum 
of our belief about the function value. 



4.1 Asymptotic Normality of GASS 

In this section, we study the asymptotic convergence rate of Algorithm 1 under the assump- 
tion that the parameter sequence {Ok} converges to a unique equilibrium point 9* of the 



ODE (19) in the interior of Q. This indicates that there exists a small open neighborhood 



J\f{9*) of 9* such that the sequence {9k} will be contained in J\f{9*) for k sufficiently large 



w.p.l. Thus, the projection operator in ( |17[ ) and Zk in (18) can be dropped in the analysis, 
because the projected recursion will behave identically to an unconstrained algorithm in 
the long run. Define C{9) = Vg'l(9';9)\g'=g and let Jc be the Jacobian of C. Under our 



conditions, it immediately follows from (19) that C{9*) = {0} and C{0*) = 0. Since C is 
the gradient of some underlying function F(9), Jc is the Hessian of F and Algorithm 1 is 
essentially a gradient-based algorithm for maximizing F(0). Therefore, it is reasonable to 
expect that the following assumption holds: 
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Assumption 2. The Hessian matrix Jc{S) is continuous and symmetric negative definite 
in the neighborhood N {6*) of 6* . 

We consider a standard gain sequence = ao/k"^ for constants ao > and < a < 1, 
a polynomially increasing sample size = Nok'^ with Nq > 1 and ( > 0. 

By dropping the projection operator in (17), we can rewrite the equation in the form: 

4+1 = Sk + fc-"a>fc£(0fe) + k-^^k (S^ - S^) , 

where 5k = Ok - 9* and <^>k = ao(yar0^ {T{X)) + e/)^^ Next, by using a first order Taylor 
expansion of C{9k) around the neighborhood of 6* and the fact that C{6*) = 0, we have 

5k+i = 5k + k-'^^kJcmSk + fc^^^fcfS^ - 

where 9k lies on the line segment from 9k to 9*. For a given positive constant r > 0, the 
above equation can be further written in the form of a recursion in [5J: 

5k+i = {I- k-^Tk)5k + k-^^+-y^<^kWk + k-^~-/^Tk, 

where Tk = -<^kM9k), Wk = A;(-")/2(| - Eg^ and = W2<D,(|i - & + 

^^fc [f^l'^'^'^i] ~ V^)' '^'^^ basic idea of the rate analysis is to show that the sequence 
of amplified differences {k'^^'^6k} converges in distribution to a normal random variable 
with mean zero and constant covariance matrix. To this end, we show that all sufficient 
conditions in Theorem 2.2 in [5] are satisfied in our setting. We begin with a strengthened 
version of Assumption l{iv). 

Assumption 3. 

For a given constant r > and x £ X, k'^^'^\S0f,{H{x)) — Sgf.{H{x))\ — )• as k ^ oo w.p.l. 

Assumption 3 holds trivially when Sq is a deterministic function that is independent 
of 9. In addition, if sample quantiles are involved in the shape function and Sg^{H{x)) 
takes the form Q, then the assumption can also be justified under some additional mild 
regularity conditions; cf. e.g., [9j. 

Let $ = Qo(Var0.(r(A)) + el)'^ and T = -^Jc{9*). The following result shows 
condition (2.2.1) in Theorem 2.2 of [5]. 

Lemma 3. Assume Assumptions 1 and 2 hold, we have <l>fc — t- <1> and — )• F as k ^ oo 
w.p.l. In addition, if Assumption l(iv) is replaced with Assumption 3 and Nk = N^k^ with 
Q > t/2, then Tk ^ as A; —>• oo w.p.l. 

In addition, the noise term Wk has the following property, which justifies condition 
2.2.2 in 0. 
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Lemma 4. Egi^[Wk\J~k-i] = 0. In addition, let t be a given constant satisfying t > a. If 
Assumption 1 holds and = N^k'^"'^ , then there exists a positive semi- definite matrix S 
such that \iink^^Ee^[WkW^\Fk-i] = S w.p.l, and Mmu^^ E[I{\\Wkf > rk'^]\\Wkf] = 
Vr > 0. 

The following asymptotic normality results then follows directly from Theorem 2.2 in 

U- 

Theorem 2. Let Ok = ao/k" for < a < 1. For a given constant r > 2a, let Nj. = 
NQk'^~"). Assume the convergence of the sequence {0^} occurs to a unique equilibrium 
point 6* w.p.l. If Assumptions 1, 2, and 3 hold, then 

ki{ek - e*) N{<d,QMQ^), 

where Q is an orthogonal matrix such that { — J c{(^*))Q = ^ with A being a diagonal 
matrix, and the {i,jY^ entry of the matrix Ai is given by A4(^ij^ = (Q^$E$"^Q)(jj)(A(j j) + 

Theorem 2 shows the asymptotic rate at which the noise caused by Monte-Carlo random 
sampling in GASS will be damped out as the number of iterations k — )• oo. This rate, as 
indicated in the theorem, is on the order of 0{l/^/k^). This implies that the noise can be 
damped out arbitrarily fast by using a sample size sequence {N^} that increases sufficiently 
fast as A; — )• oo. However, we note that this rate result is stated in terms of the number of 
iterations k, not the sample size Nj.. Therefore, in practice, there is the need to carefully 
balance the tradeoff between the choice of large values of to increase the algorithms's 
asymptotic rate and the use of small values of Nk to reduce the per iteration computational 
cost. 

5. Numerical Experiments 

We test the proposed algorithms GASS, GASS_avg on some benchmark continuous opti- 
mization problems selected from [8] and [9]. To fit in the maximization framework where 
our algorithms are proposed, we take the negative of those objective functions that are 
originally for minimization problems. The ten benchmark problems are listed as below. 

(1) Dejong's 5th function (n=2, 
Hi{x) = - 

where aji = (-32, -16, 0, 16, 32, -32, -16, 0, 16, 32, -32, -16, 0, 16, 32, -32, -16, 0, 16, 
32, -32, -16, 0, 16, 32) and aj2 = (-32, -32, -32, -32, -32, -16, -16, -16, -16, -16, 0, 
0, 0, 0, 0, 16, 16, 16, 16, 16, 32, 32, 32, 32, 32). The global optimum is at x* = (-32, -32)^, 
and H* ^ -0.998. 



-50 <xi< 50) 



25 



0.002 + 
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(2) Shekel's function (n=4, < < 10 ) 

5 

H2{x) = ^{{x - ai)'^{x - Qi) +Ci) , 

i=l 

where ai = (4,4,4,4)^, 02 = (1,1,1,1)^, 03 = (8,8,8,8)^, 04 = (6,6,6,6)^, 05 = 
(3, 7, 3, 7f, and c = (0.1, 0.2, 0.2, 0.4, 0.4). x* = (4, 4, 4, 4)^, H* f« 10.153. 

(3) Powel singular function (n=50, —50 < Xj < 50) 

n-2 

i73(a;) = - ^ [{xi-i + lOxi)^ + 5(xj+i - Xi+2f + {xi - 2xi+i)^ + 10(xi_i - Xi+2)^] -1, 

i=2 

where x* = (0, • • • , 0)^, H* = -1. 

(4) Rosenbrock function (n=10, — 10 < Xj < 10) 

n— 1 

H4{x) = -Yl [I00(xi+i - xf )2 + (xi - 1)2] - 1, 

i=l 

where x* = (1, • • • , 1)^, H* = -1. 

(5) Griewank function (n=50, —50 < Xj < 50) 

n n ^ s 

^=w=-i^E-?+n-(^)-i. 

i=l 1=1 ^ ^ ' 

where x* = (0, • • • , 0)^, H* = 0. 

(6) Trigonometric function (n=50, —50 < Xj < 50) 

n 

Hq{x) = - ^ [8 sin2(7(xi - 0.9)^) + 6 sin2(14(xi - 0.9)^) + (x^ - 0.9)^] - 1, 
1=1 

where x* = (0.9, • • • , 0.9)^, H* = -1. 

(7) Rastrigin function (n=20, -5.12 < x^ < 5.12) 

n 

Hj{x) = - ^ [xf - 10cos(27rxi)) - lOra - 1, 

i=l 

where x* = (0, • • • , 0)^, H* = -1. 
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(8) Pinter's function (n=50, -50 < Xi < 50) 



ixf + 20i sin^ sin Xj — + sin a^j+i ] 

i=l i=l 
n 

+ ^ i logiQ(l + i{xl_i - 2xi + Sxj+i - cos Xj + l)' 



where x* = (0, • • • , 0)"^, H* = -1. 
(9) Levy function (n=50, -50 < Xj < 50) 

n-l 



ff9(x) = -sin2(7ryi)-J]; [(y^ - 1)2(1 + 10 sin2(7ryi + 1))] -(y„-l)2(l+10 sin2(27ry„))-l, 



1=1 



where = 1 + (x^ - l)/4, x* = (1, • • • , 1)^, H* = -1. 
(10) Weighted Sphere function (n=50, —50 < Xj < 50) 

n 

Hio{x) = -^ixf - 1 

i=l 

where x* = (0, • • • , 0)"^, H* = -1. 

Specifically, Dejong's 5th (Hi) and Shekel's (-^2) are low-dimensional problems with a small 
number of local optima that are scattered and far from each other; Powel {H3) and Rosen- 
brock (-^4) are badly-scaled functions; Griewank (-^5), Trigonometric {Hq), and Rastrigin 
(-ffy) are high-dimensional multimodal problems with a large number of local optima, and 
the number of local optima increases exponentially with the problem dimension; Pinter 
(Hs) and Levy (^^9) are both multimodal and badly-scaled problems; Weighted Sphere 
function (Hiq) is a high-dimensional concave function. 

We compare the performance of GASS and GASS_avg with two other algorithms: the 
modified version of the CE method based on stochastic approximation proposed by [9J and 
the MRAS method proposed by [H]. In our comparison, we try to use the same parameter 
setting in all four methods. The common parameters in all four methods are set as follows: 
the quantile parameter is set to be p = 0.02 for low-dimensional problems Hi and H2, 
and p = 0.05 for all the other problems; the parameterized exponential family distribution 
/(x; 9k) is chosen to be independent multivariate normal distribution M{pk^ ^fc); the initial 
mean /io is chosen randomly according to the uniform distribution on [-30,30]"", and the 
initial covariance matrix is set to be Sq = lOOO/^xn, where n is the dimension of the 
problem; the sample size at each iteration is set to be = 1000. In addition, we observe 
that the performance of the algorithm is insensitive to the initial candidate solutions if the 
initial variance is large enough. 
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In GASS and GASS_avg, we consider the shape function of the form Ml), i.e., 



Se,{Hi.)) = (Hi.) - H,.):^^^shm^y 

In our experiment, is set to be 10^, which makes Sq^{H{x)) a very close approximation 
to {H{x) — Hih)I{H{x) > 76ifc}; the (1 — /9)-quantile 70^ is estimated by the (1 — p) sample 
quantile of the function values corresponding to all the candidate solutions generated at 
the k^'^ iteration. We use the step size: ak = ao/k", where ao reflects the initial step size, 
and the parameter a should be between and 1. We set ao = 0.3 for the low-dimensional 
problems Hi and H2 and the badly-scaled problem H4, and set ao = 1 for the rest of 
the problems; we set a = 0.05, which is chosen to be relatively small to provide a slowly 
decaying step size. With the above setting of step size, we can always find a (3 such that the 
sample size = 1000 satisfies the Assumption [l](ii) under a finite number of iterations, 
e.g. k < 2500 in our experiment. In GASS_avg, the feedback weight is c = 0.002 for 
problems H3, i/4 and and c = 0.1 for all other problems. 

In the modified CE method, we use the gain sequence = 5/(/c + lOO)*^'^"^, which is 
found to work best in the experiments. In the implementation of MRAS method, we use a 
smoothing parameter when updating the parameter 9^ of the parameterized distribution, 
and set u = 0.2 as suggested by [8]. The rest of the parameter setting for MRAS is as 
follows: A = 0.01, r = 10~^ in the shape function S{H(x)) = exp{rH{x)} . Other than 
using an increasing sample size in [9] and [8], and updating quantile pfc in f8], the constant 
sample size = 1000 and a constant p are used in our experiments for a fair comparison 
of all the methods. 







GASS 


GASS_avg 


modified CE 


MRAS 


H* 


H* (std.err) 




H* {std_err) 


A4 


H* {std_err) 




H* (std.err) 


Me 


Dejong's 5th Hi 


-0.998 


-0.998{4.79E-7) 


100 


-0.998(8. 97E-7) 


100 


-1.02(0.014) 


95 


-0.9981(6.63E-4) 


98 


Shekel H2 


10.153 


9.92(0.114) 


96 


9.91(0.106) 


95 


10.153(1. 09E-7) 


79 


9.90(0.126) 


96 


Powel H3 


-1 


-l(1.48E-6) 


100 


-l(1.89E-6) 


100 


-l(8.87E-9) 


100 


-1.50(0.433) 


95 


Rosenbrock H4 


-1 


-1.03(1.40E-4) 





-1.09(0.0301) 


46 


-1.91(0.016) 





-7.10(0.629) 





Griewank H5 





0(8.45E-15) 


100 


0(7.30E-15) 


100 


-0(3.02E-16) 


100 


-0.14(0.017) 


57 


Trigonometric He 


-1 


-1(9.72E-13) 


100 


-1(1.08E-12) 


100 


-1(2.23E-18) 


100 


-l(4.69E-7) 


100 


Rastrigin Hr 


-1 


-1.15(0.0357) 


85 


-1.19(0.044) 


83 


-1.01(0.0099) 


99 


-83.45(0.634) 





Pinter Hg 


-1 


-1.007(0.0034) 


93 


-1.04(0.0104) 


63 


-6.08(0.0254) 





-530.4(48.64) 


2 


Levy Hg 


-1 


-1(9.56E-13) 


100 


-l(1.29E-7) 


100 


-1. 063(3. 87E-18) 


100 


-1(1.42E-10) 


100 


Sphere Hiq 


-1 


-1(1.79E-11) 


100 


-1(1.42E-11) 


100 


-1(2.23E-18) 


100 


-l(9.95E-9) 


100 



Table 1: Comparison of GASS, GASS_avg, modified CE and MRAS 



In the experiments, we found the computation time of function evaluations dominates 
the time of other steps, so we compare the performance of the algorithms with respect to 
the total number of function evaluations, which is equal to the total number of samples. 
The average performance based on 100 independent runs for each method is shown in Table 
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[l| where H* is the true optimal value of H{-); H* is the average of the function values 
returned by the 100 runs of an algorithm; std^err is the standard error of these 100 function 
values; is the number of e-optimal solutions out of 100 runs (e-optimal solution is the 
solution such that H* — H* < e, where H* is the optimal function value returned by an 
algorithm). We consider e = 10~^ for problems H^, H-j, Hg and e = 10~^ for all other 
problems. Fig. [ijand Fig. [2] show the average (over 100 runs) of best value of H{-) at the 
current iteration versus the total number of samples generated so far. 

From the results, GASS and GASS.avg find all the e-optimal solutions in 100 runs for 
problems Hi, H^, H^, Hq, Hg, and Hiq. Modified CE finds all the e-optimal solutions for 
problems H^, H^, Hq, Hg, and Hiq. MRAS only finds all the e-optimal solutions for the 
problems Hq and Hg and the convex problem Hiq. As for the convergence rate, GASS_avg 
always converges faster than GASS, verifying the effectiveness of averaging with online 
feedback. Both GASS and GASS_avg converge faster than MRAS on all the problems, and 
converge faster than the modified CE method when oq is set to be large, i.e. on problems 
Hs and H5 - Hiq. 

6. Conclusion 

In this paper, we have introduced a new model-based stochastic search algorithm for solv- 
ing general black-box optimization problems. The algorithm generates candidate solutions 
from a parameterized sampling distribution over the feasible region, and uses a quasi- 
Newton like iteration on the parameter space of the parameterized distribution to find 
improved sampling distributions. Thus, the algorithm enjoys the fast convergence speed 
of classical gradient search methods while retaining the robustness feature of model-based 
methods. By formulating the algorithm iteration into the form of a generalized stochastic 
approximation recursion, we have established the convergence and convergence rate re- 
sults of the algorithm. Our numerical results indicate that the algorithm shows promising 
performance as compared with some of the existing approaches. 

A. Appendix 

Proof. Proposition 111 Consider the gradient of L{9; 9') with respect to 9, 



where the interchange of integral and derivative in the first equality follows from the bound- 
edness assumptions on Sgi and 'Vgf{x;9) and the dominated convergence theorem. 




Ee[Se'{H{X))Velnf{X;9)], 



(20) 
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Consider the Hessian of L{6; 6') with respect to 9, 
Vlme') = [ Se'{H{x))Vlf{x-9)dx 



Se'{H{x))f{x;9)Vilnf{x;9)dx + j Se'{H{x))Velnf{x;9)Vef(.x;9ydx 

= Ee[Se'{H{X))Vl lnf{X; 9)] + Ee[Se'{H{X))Ve lnf{x; 9)Ve hi/(x; 0)^](21) 

where the last equahty follows from the fact that Vef{x;9) = f {x;9)Vg In f{x; 9). 
Furthermore, if /(x; 9) = exjp{9'^T{x) — 4'{9)}, we have 

Ve\nf{x;9) = (9'^T{x) -In j exp{9'^T{x))dx 

JeM^^T{x))T{x)dx 



T{x) - Eg[T{X)]. 



Plugging (22) into (20) yields 



Vem9') = Ee[S9'{H{X))T{X)] - Ee[Se'{H{X))]Eg[T{X)]. 



Differentiating (22) with respect to 9, we obtain 

/ exp{9'^T{x))T{x)T{xfdx 



V^ln/(x; 



+ 



J exp{9'^T{x))dx 
J exj>{9^T{x))T{x)dx (/ expi9^T{x))T{x)da 
(/exp(rT(x))dx)^ 
-Ee[T{X)T{Xf] + Ee[T{X)]Ee[T{X)f 
-Vare[r(X)]. 



Plugging (|22j) and (|23|) into (|21j) yields 

Vjm9') = Ee[Se.{H{X)){T{X)-Ee[T{X)]){T{X)-Ee[T{X)]f] 
- \8Je[T{X)]Ee[Sei{H{X))]. 



^el{e;9') 



Proof. Proposition [2| Consider the gradient of l{9;9') with respect to 9, 

VeLi9;9') 
L{9;9') 

J Se'{H{x))f{x-9)Ve I n f{x;9)dx 
L{9;9') 
= Ep^..g,)[Velnf{X-9')]. 



(22) 



(23) 



□ 



(24) 
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Differentiating (24) with respect to 0, we obtain the Hessian 



/ Se>{H{x))f{x- e)Vl In /(x; e)dx , / Se>{H{x))Ve In /(x; ^)(Ve/(x; 0))^dx 



+ 



L{e-e') L{e;e') 

if Se^{H{x))f{x- 0)Ve In f{x; d)dx){VgL{e; d') f 



■ 0'\2 



Using 'SJgf^x] 6) = f{x; 9)Vq lnf{x; 9) in the second term on the righthand side, the above 
expression can be written as 

Vll{e-e')\e=e' = Ep^..e')Nllnf{X;e')]+Ep^..e') [Ve'lnf{X;e'){Ve'lnf{X;0')f] 
- Ep^..o') [VelnfiX-e')] [Vein f{X;9')f 

= Ep^..g,) [Vl In f{X; 6')] + YaVp^.-e') [Ve In f{X; 6')] . (25) 



Furthermore, if f{x;d) = exp{0^T(x) — 4>{9)}, plugging (22) into (24) yields 
Vem9')\e=e' = Ep^..g,)[T{X)] - Eg,[T{X)], 



and plugging (22) and (23) into (25) yields 

V^/(e;^')|e=e' = Varp(,,,)[r(X)] - Var,,[r(X)]. 



□ 



Proof. Lemma [T| Because Sq is continuous in jg, it is sufficient to show that 79^. — t- 7^^. 
w.p.l as A; — )• 00, which can be shown in the same way as Lemma 7 in |8], except that we 
need to verify the following condition in their proof: 



00, 



k=l 



where M is positive constant. It is easy to see that this condition is trivially satisfied in 
our setting by taking = Nok'' with C > 0. □ 

Proof. Lemma [2], Before the formal proof of Lemma [2| we first introduce a key inequality 
to our proof - the matrix bounded differences inequality ([26]), which is a matrix version 
of the generalized Hoeffding inequality (i.e., McDiarmid's inequality (p[5]))- Let Amax(') 
and Amin(0 return the largest and smallest eigenvalue of a matrix, respectively. 
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Theorem 3. (Matrix bounded differences, Corollary 7.5, 126]) Let {X^ : i = 1,2, . . . , N} be 
an independent family of random variables, and let V be a function that maps N variables 
to a self-adjoint matrix of dimension d. Consider a sequence of {Ck} of fixed self-adjoint 
matrices that satisfy 

{V{x\...,x\...,x'')-V{x\...,x\...,x'')f <Cl 

where x^ and x^ range over all possible values of for each index i. Compute the variance 
parameter 



Then, for all 6 > 0, 



^{Amax(^^(x) - E[V{^)]) >5}< dexp 



where x = 

Now we proceed to the formal proof of Lemma [2] Recall that can be written as 
= - V,~'){Ej,^[T{X)] - Eg^iTiX)]) + V,-\Ep^[T{X)] - Ep^[T{X)]). (26) 



To bound the first term on the right-hand-side in ( 26 ) , we notice that since ^ and Vf^ ^ 
are both positive definite and (e~^/— V^~^) and (e~^/— V^~^) are both positive semi-definite, 
we have 

-1,, 



\v^'-yk 'W = \\v,\v,-v,)v,'\\ 

< \\Vu^\\\\yk-Vk\\\\%^\ 



(27) 



To establish a bound on || — Vfc||, we use the matrix bounded differences inequality that 
is introduced above. For simplicity of exposition, we drop the subscript k in the expression 
below. 

2 

l-rr/l 1 /V\ -rr/l ~1 /V\l 

sup 



..,x\...,x^)-V{x'^,...,x' 



\v{x^ 

^2 sup I [T{x')T{xY -T{ff)nx') 

^ x\x^eX I 

-j;^T.^{x^Hnx')-nx^)) 



.,x-)y 

^5](r(x^)-r(s^))r(xT- 
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where C is a fixed positive semidefinite matrix. This last inequahty is due to Assump- 
tion [ijiv) that r(x) is bounded on X. Note that conditioning on Fk-i, {x\, i = 1, . . . , N^} 
are i.i.d., and EQ^\Vk\J'k-i\ = Vk- Then according to the matrix bounded differences in- 
equahty, for all (5 > 0, 

P [Xr^M -Vk)>5 < dexp (^^) , 



which also implies 



p{-\rnUVk -Vk)>5 \Fk-i} = p[\m.^{Vk -Vk)>5 \Fk-i} < d 



exp 



-Nk5^ 
8||C||2 



Recall that for a symmetric matrix A, \\A\\2 = max(Aniax(^), — Amm(^)) and \\A\\ < \\A\\2. 
Hence, 



P{\\Vk-Vk\\>5 <p[\\Vk-Vkh>5 \Fk-i] < 2d exp 

Recall that for any nonnegative random variable X, 

/•oo 

E[X] = / P{X>x)dx 
Jo 

/•oo 

< a+ P{X >x)dx. 

J a 

So we have 



-Nk6^ 



E 



\\Vk-Vkf \Fk-i 



< a+ I P\\\Vk-Vk\\>Vx\Tk}dx 

-Nkx' 



< a + 



2(iexp 



ICII 



dx. 



Set a = 8||C||2 log (2(i)/A''fc, and we obtain 



E 



I Vfe — Vfcll \Tk-i 



< E 



\\Vk-Vkf l-Ffc-i 



< 



j||C||2(l + log(2d)) 
Nk 



(28) 



To bound the second term in the right-hand-side of (26), notice that Epi^[Tj{X)] is 
a self-normalized importance sampling estimator of Ep^[Tj{X)], where Tj[X) is the j*^ 
element in the vector T{X). Applying Theorem 9.1.10 (pp. 294, [2j), we have 



E 



\Ep,[T,{X)]- Ep,[T,{X)]\''\Tk- 
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where Cj's are positive constants due to the boundedness of Tj{x) on X. Hence, 



E 



\Ep^[T{X)]-Ep^[T{X)]\\\J^,_ 



< E[\\Ep,[T{X)]-Ep,[T{X)]f\Tk^ 

d 

< 5^i^[|^pjr,-(x)]-i?pjT,-(x)]|V; 



< 



d max,- c 



■J ^3 



(29) 



Putting (28) and (29) together, we obtain 



E[\M] < E \e''\\Vk-Vk\\\\Ep^[T{X)]-Ee,[T{X)]\\ + \\V,-'\\\\Ep^[T{X)] - E,^[T{X)] 



< Me-'^E 



E 



— Vfciii-Ffc-i 



+ e-^E 



E 



Ej,^[T{X)]-E,^[T{X)]\\\F^_ 



< 



Me-V8||C||2(l + log(2d)) + e-V^™axjCj 



where the positive constant M is due to the boundedness of T(x) on X. 
Therefore, for any T > 



E 



.i=k 



i=k 

oo 



i=k 

oo 



< c 



i=k 



1 



+ 



+ 



—;dx 

k 2;P 

1 1 



1 



where the first hne follows from the monotone convergence theorem, and the third line 
follows from Assumption [Tj^ii). For any r > 0, we have from Markov's inequality 



\i=k 



E[TZk^^U^ 
T 



< 



C ( 1 



+ 



1 1 



-1 



as /c — )• oo. 



25 



where the last statement is due to /3 > 1. This result of convergence in probability to- 
gether with the fact that the sequence {YliZk monotone implies that the sequence 
{Zli^fc "illCill} converges w.p.l as A; oo. Furthermore, since sup|„.o<^n-i „.<r} || I]"=fc aj^ill < 
suP{„:o<j:^r,i ai<T} T,7=k (^i\M < EZk ^Mil conclude that {sup|„^o<j^n-^i ^^^^y \\ J27=k ^^i^i 
converges to w.p.l as — )■ oo. □ 

Proof. Theorem [ij To show our theorem, we apply Theorem 2.1 in jllj . The condition on 
the step size sequence in their theorem is satisfied by our Assumption [T]^i) , and condition 
(2.2) there is a result of Lemma [2j Thus, to establish convergence, it is sufficient to show 
6fc — >• w.p.l as A; — oo. Note that 

hk = V^'{%,[T{X)]-E^,[T{X) 



^ \Nk Yk Yk Vfe 



Hence, 



YkYk ^ Yk 



ll^fcll < " '" |Vfc-Vfc| + ^^||Ufc-Ufc|| 



< 



iVk 



\YkYk\ iVfcl 
II 



\v-^\\ 1 ^ 

\^k\ k ^^-^ 



Since T[x) is bounded, it is easy to see that ^|^^ is also bounded. Furthermore, note 

that is bounded and |Vfc| is bounded away from zero. This together with Assump- 

tion [Tj^iv) imply that the sequence {hk} converges to zero w.p.l. □ 



Proof. Lemmajsj Under Assumption 1, we know that the sequence {Ok} converges w.p.l. 
to a limiting point 9* . This, together with Assumption [T|iii) , implies that the sequence of 
sampling distributions {/(x; 6k)} will converge point-wise in x to a limiting distribution 
/(x;^) w.p.l. Notethat ||Vare,(T(X))-Var0*(T(X))|| < \\V^ve,{T{X))-\sxe,{T{X))\\ + 
\\\aie^{T{X)) — \aig*{T{X))\\. Clearly, the first term converges to zero by the strong 
consistency of the variance estimator. On the other hand, using the point-wise convergence 
of {/(•; Ok)} and the dominated convergence theorem, it is easy to see that the second term 
also vanishes to zero. This shows <I>fc — )• $ w.p.l. Thus, the convergence of F^ to F is a 
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direct consequence of the continuity assumption of Jc in the neighborhood of 6* . Regarding 
Tfc, we have 



TkA + Tfc 2, 



where Tk,i = ^^^^.u^ 
Note that 



\\Tk,i\\ < \\<^k 



^ A:-/2$;,Mi and T^.s = W^^^l^^,^ 



1=1 



l^fcll A:"' 



/2 ^fe 



J^l4(i/(xy)-5,,(//(xl))|||r(x^,)|| 



(30) 



1=1 



Since T{x) is bounded, it is easy to see that 



UM 



is also bounded. Furthermore, note 



that |Vfc| is bounded away from zero. This, together with the boundedness of \\^k\\ and 
Assumption 3, imply that the right-hand-side of (30) converges to zero w.p.l. 

For term Tk^2, let and be the zth components of Ufc and Ufc, respectively. By 

using a second order two variable Taylor expansion of ^ around wj^, we have 

Vfc 



IV IV 1 IV IV 1 

^ = + — (Ui - Ul) - ^(V, - V,) + ^(Vfc - Ykf - ^(Ui - Ul)(V, - V,), 

Vfc Vfc Vfc Yl 



where UJ, and Vfc are on the line segments from to \]\ and from Vfc to Vfc. Taking 
conditional expectations at both sides of the above equation, we have 



Vfc 



Vfc 



l^^l(Vfc-Vfc)2 



(Vfc - Vfc)2 



+ Ee,\^m-l]l){Yk-Yk)\ 



fc-i 



l(ui-uy(Vfc-Vfc)| 



(31) 



for constants Ci > and C2 > 0. Thus, a straightforward calculation shows that the right- 
hand-side of (31) is 0{N^^). Consequently, we have Tfc^2 — ^ w.p.l. as /c — )• 00 by taking 
Nk = Nok'^ with C > T"/2. This shows Tfc w.p.l. as desired. □ 
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Proof. Lemma[4| Eg^, [Wk\J-k-i] = follows directly from the definition of W^- Again, we 
let and be the ith components of Ufc and Ufc, let Tj(x) be the ith component of the 
sufficient statistic T(a;), and define Sf^- as the (i,j)th entry of the uiatiiK. E0^[WkW^\Tk-i]- 

By using a first order two variable Taylor expansion of ^ around y^, we have 



+ 



1 



1 



Wk Vfc Vfe 
rm. 



(u^-ul)-^(Vfe-v,) + 7^fc, 

k 



where TZk is a reminder term. Therefore, T,fj can be expressed as 



fc-1 



k TP 





fe-1 



--k^-'^-^Ee.m-UDiUi-UDlTk-i] 

k 



k^--^Ee,m - Uy(V, - Vfc)|J-,_i] 



1 



IT* IT-? 



where "^fc represents a higher-order term. 



1 .i?ej5|,(^(x))T,(x)r,(x)|j-,_i] 



E; 



Pk 



UX)T,{X) 



El[Se,{H{X))] 
Pk{X) 



f{X;9k) 



E,^[T,{X)]E,^[T,{X)] 



By using a similar argument, it can be seen that 



A;" 



E,^[T,iX)]Ep^ [t,{X) 



f{X;9k) 
Pk{X) 



Ep,[UX)]Ep,[T,iX)]E^^ 



f{X;dk 

Pk{X) 



Pk 



f{X;ek 



Ep^[Ti{X)]Ep^[T,iX)] 
E,^[Ti{X)]Ep^[T,{X)] 

- Ep^[T,{X)]Ep^[T,{X)] 



(32) 
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Therefore, 



[in] + [iv] + F-°7^fc 

mix) - Ep^[TiiXmT,{X) - E,^[T,{X)]) 
mix) - Ep^miX)])iT,iX) - Ep^[T,iX)]) 



PkiX) 
fiX;ek) 
PliX) 



PiX;9k 



By taking Nf^ = Nok'^^"', it can be shown that the higher-order term k'^~"'1Zk is o(l). In 
addition, since Seiy) is continuous in 9 for a fixed y, the point-wise convergence of /(•; ^a:) 
to fi-',9*) imphes that Pkix) wih also converge in a point-wise manner to a hmiting distri- 
bution Thus, the dominated convergence theorem suggests that S^^- will converge 
to 



CEe^ 



iT,iX) - E,,[T,iX)])iT,iX) - Ep,[T,iX)] 



pliX) 



Pix-e* 

for some positive constant C. Therefore, the limiting matrix S is given by 

P*iX) 



= Cove*{iTiX) - Ep,[TiX)]) 



fiX-0*)J' 



where Cov0*i-) is the covariance matrix with respect to 

To show the last statement, we use Holder's inequality and write 

1 

lim E[I{\\Wkf > rk'^nWkf] < limsup > rfe")j ' \E[\\Wk\\' 



(33) 



Note that 



P{\mf > rk^) = P{\\Wk\\ > V^k"/') 



E\\\Wkf] 

< by Chebyshev's inequality 

rk"' 

^ E[Ee,[\mf\Tk-i]] 

rk°^ 
_ S[tr(S^)] 

= 0(A;^") 

by taking = Nok'^~'^ for k sufficiently large, where the last step follows because all 
entries in T,^ are bounded and thus convergence w.p.l. implies convergence in expectation. 
On the other hand, by (32), can be expressed in terms of the fourth order central 

moments of the sample mean and it can be verified that = 0(1)- This shows 

that the right-hand-side of (33) is Oik~^), which vanishes to zero as k ^ oo. □ 
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Figure 1: Comparison of GASS, GASS_avg, modified CE and MRAS 
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Figure 2: Comparison of GASS, GASS_avg, modified CE and MRAS 
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